Show the code
source("utils.R")
theme_set(theme_minimal())This website is still under active development - all content subject to change
Until now, we have considered the cells to be represented in a point pattern. However, as cells have a shape and area, this might be an oversimplification in some cases. Alternatively, we can rely on the segmentation of individual cells that are available for various datasets. The outline of each cell is represented by a polygon and the collection of all cells can be seen as an irregular lattice. Unlike a regular lattice (e.g., spot-based spatial transcriptomics data), the sample areas in an irregular lattice can have different sizes and are not necessarily regularly distributed over the sample space.
For this representation of the cells we will rely on the SpatialFeatureExperiment package. For preprocessing of the dataset we refer the reader to the vignette of the voyager package (Moses et al. 2023). The voyager package also provides wrapper functions around the package spdep (Pebesma and Bivand 2023) that work directly on the SpatialFeatureExperiment object.
class: SpatialFeatureExperiment
dim: 980 100290
metadata(0):
assays(1): counts
rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
rowData names(3): means vars cv2
colnames(100290): 1_1 1_2 ... 30_4759 30_4760
colData names(17): Area AspectRatio ... nCounts nGenes
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : CenterX_global_px CenterY_global_px
imgData names(1): sample_id
unit: full_res_image_pixels
Geometries:
colGeometries: centroids (POINT), cellSeg (POLYGON)
Graphs:
sample01:
[1] 20
For some examples we will show a subset of the tissue.
In this vignette, we will show the metrics related to two marker genes: KRT17 (basal cells) and TAGLN (smooth muscle cells).
One of the challenges when working with (irregular) lattice data is the construction of a neighbourhood graph (Pebesma and Bivand 2023). The main question is, what to consider as neighbours, as this will affect downstream analyses. Various methods exist to create neighbours, such as contiguitiy based neighbours (neighbours in direct contact), graph-based neighbours (e.g., \(k\)-nearest neighbours), distance based neighbours or higher order neighbours (Getis 2009; Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023). The documentation of the package spdep gives an overview of the different methods.
Segmentation of individual cells is challenging (Wang 2019) and construction of contiguity-based neighbours based on individual cell segmentation assumes very accurate segmentation results. Furthermore it would neglect the influence of more distant, not directly adjacent neighbours, which based on the feature of interest might not be the correct assumption.
In an irregular lattice, the task of finding a spatial weight matrix is more complex, as different options exist. One option is to base the neighbourhood graph on neighbours that are in direct contact with each other (contiguous), as implemented in the poly2nb method. As cell segmentation is notoriously imperfect, we add a snap value, which means that we consider all cells with distance 20 or less as contiguous.
Alternatively, we can use a k-nearest neighbours approach. The the number \(k\) is somewhat arbitrary.
The graphs below show noticeable differences. In the contiguous neighbour graph on the left (neighbours in direct contact), we can see the formation of distinct patches that are not connected to the rest of the tissue. In addition some cells don’t have any direct neighbours. In contrast, the k-nearest neighbours (kNN) graph on the right reveals that these patches tend to be connected to the rest of the structure.
Here we set the arguments for the examples below.
Global methods give us an overview over the entire field-of-view and summarize the spatial autocorrelation metric to a single value. The metrics are a function of the weight matrix and the variables of interest. The variables of interest can be gene expression, intensity of a marker or the area of the cell. The global measures can be seen as a weighted average of the local metric, as explained below.
In general, a global spatial autocorrelation measure has the form of a double sum over all locations \(i,j\)
\[\sum_i \sum_j f(x_i,x_j) w_{ij}\]
where \(f(x_i,x_j)\) is the measure of association between features of interest and \(w_{ij}\) scales the relationship by a spatial weight as defined in the weight matrix \(W\). If \(i\) and \(j\) are not neighbours, i.e. we assume they do not have any spatial association, the corresponding element of the weights matrix is 0 (i.e., \(w_{ij} = 0\)). In the following we will see that the function \(f\) varies between the different spatial autocorrelation measures (Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023).
The global Moran’s I (Moran 1950) coefficient is a measure of spatial autocorrelation, defined as:
\[I = \frac{n}{\sum_i\sum_j w_{ij}} \frac{\sum_i\sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}.\]
where \(x_i\) and \(x_j\) represent the values of the variable of interest at locations \(i\) and \(j\), \(\hat{x}\) is the mean of all \(x\) and \(w_{ij}\) is the spatial weight between the locations of \(i\) and \(j\). The expected value is close to \(0\) for large \(n\) (\(\mathbb{E}(I) = -1/(n-1)\)), whereas a value higher than indicates spatial auto-correlation. Negative values indicate negative auto-correlation.
voyagerDataFrame with 2 rows and 2 columns
moran K
<numeric> <numeric>
KRT17 0.725630 3.79983
TAGLN 0.282208 7.98489
We can also use the moran.mc function to calculate the Moran’s I coefficient. This function uses a Monte Carlo simulation to calculate the p-value.
DataFrame with 2 rows and 10 columns
means vars cv2 is_neg moran.mc_statistic_sample01
<numeric> <numeric> <numeric> <logical> <numeric>
KRT17 1.378333 14.67222 7.72303 FALSE 0.725630
TAGLN 0.714079 3.76205 7.37788 FALSE 0.282208
moran.mc_parameter_sample01 moran.mc_p.value_sample01
<numeric> <numeric>
KRT17 201 0.00497512
TAGLN 201 0.00497512
moran.mc_alternative_sample01 moran.mc_method_sample01
<character> <character>
KRT17 greater Monte-Carlo simulati..
TAGLN greater Monte-Carlo simulati..
moran.mc_res_sample01
<list>
KRT17 -0.00403813, 0.01123295, 0.00700654,...
TAGLN -0.00331296,-0.00455799,-0.00374588,...
[1] 0.004975124 0.004975124
[1] "Monte-Carlo simulation of Moran I" "Monte-Carlo simulation of Moran I"
We can see both genes have a positive Moran’s I coefficient and a highly significant p-value. The expected value is \(\mathbb{E}(I) = -1/(n-1)\) which is for large \(N\) close to 0. Positive and significant values indicate that areas with similar values are clustered. It is important to note that this could be both at the high or low end of the values of interest. Negative values indicate clustering of alternating values, i.e., gives a measure of spatial heterogeneity. Moreover, one should note that the result is dependent on the weight matrix. Different weight matrices will give different results. To compare Moran’s I coefficients between different values, we need to use the same weight matrix.
Geary’s \(C\) (Geary 1954) is a different measure of global autocorrelation and is very closely related to Moran’s \(I\). However, it focuses on spatial dissimilarity. Geary’s \(C\) is defined by
\[C = \frac{(n-1) \sum_i \sum_j w_{ij}(x_i-x_j)^2}{2\sum_i \sum_j w_{ij}\sum_i(x_i-\bar{x})^2}\]
where \(x_i\) and \(x_j\) represent the values of the variable of interest at locations \(i\) and \(j\), \(\hat{x}\) is the mean of all \(x\), \(w_{ij}\) is the spatial weight between the locations of \(i\) and \(j\) and \(n\) the total numer of locations. The interpretation is opposite to Moran’s \(I\): a value smaller than \(1\) indicates positive auto-correlation whereas a value greater than \(1\) represents negative auto-correlation.
voyagerDataFrame with 2 rows and 16 columns
means vars cv2 is_neg moran.mc_statistic_sample01
<numeric> <numeric> <numeric> <logical> <numeric>
KRT17 1.378333 14.67222 7.72303 FALSE 0.725630
TAGLN 0.714079 3.76205 7.37788 FALSE 0.282208
moran.mc_parameter_sample01 moran.mc_p.value_sample01
<numeric> <numeric>
KRT17 201 0.00497512
TAGLN 201 0.00497512
moran.mc_alternative_sample01 moran.mc_method_sample01
<character> <character>
KRT17 greater Monte-Carlo simulati..
TAGLN greater Monte-Carlo simulati..
moran.mc_res_sample01 geary.mc_statistic_sample01
<list> <numeric>
KRT17 -0.00403813, 0.01123295, 0.00700654,... 0.270655
TAGLN -0.00331296,-0.00455799,-0.00374588,... 0.713507
geary.mc_parameter_sample01 geary.mc_p.value_sample01
<numeric> <numeric>
KRT17 1 0.00497512
TAGLN 1 0.00497512
geary.mc_alternative_sample01 geary.mc_method_sample01
<character> <character>
KRT17 greater Monte-Carlo simulati..
TAGLN greater Monte-Carlo simulati..
geary.mc_res_sample01
<list>
KRT17 1.008629,0.987882,0.992398,...
TAGLN 1.00121,1.00322,0.99916,...
[1] 0.004975124 0.004975124
[1] "Monte-Carlo simulation of Moran I" "Monte-Carlo simulation of Moran I"
Again, the value of Geary’s \(C\) indicates that the genes are spatially auto-correlated.
The global \(G\) (Getis and Ord 1992) statistic is a generalisation of the local version (see below) and summarises the contributions of all pairs of values \((x_i, x_j)\) in the dataset. Formally that is
\[G(d) = \frac{\sum_{i = 1}^n \sum_{j=1}^n w_{ij}(d)x_ix_j}{\sum_{i = 1}^n \sum_{j=1}^n x_i x_j} \text{s.t } j \neq i.\]
The global \(G(d)\) statistic is very similar to global Moran’s \(I\). The global \(G(d)\) statistic is based on the sum of the products of the datapoints whereas global Moran’s \(I\) is based on the sum of the covariances. Since these two approaches capture different aspects of a structure, their values will differ as well. A good approach would be to not use one statistic in isolation but rather consider both.
It is recommended to use binary weights for this calculation. We will use the spdep package directly to calculate the global \(G\) statistic.
Getis-Ord global G statistic
data: counts(sfe)[features[1], ]
weights: weights_neighbourhoods_binary
standard deviate = 93.537, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
4.138026e-04 9.725734e-05 1.145260e-11
Unlike global measures that give an overview over the entire field of view, local measures report information about the statistic at each location (cell). There exist local analogs of Moran’s I and Geary’s C for which the global statistic can be represented as a weighted sum of the local statistics. As above, the local coefficients are based on both the spatial weights matrix and the values of the measurement of interest.
The local Moran’s I coefficient (Anselin 1995) is a measure of spatial autocorrelation on each location of interest. It is defined as:
\[I_i = \frac{x_i - \bar{x}}{\sum_{k=1}^n(x_k-\bar{x})^2/(n-1)} \sum_{j=1}^n w_{ij}(x_j - \bar{x})\]
where the index \(i\) refers to the location for which the measure is calculated. The interpretation is analogous to the global Moran’s I where a value of \(I_i\) higher than \(\mathbb{E}(I) = -1/(n-1)\) indicates spatial auto-correlation; smaller values indicate negative auto-correlation. It is important to note that, as for the global counterpart, the value of local Moran’s I could be a result from both the high or low end of the values. Since we measure and test a large number of locations simultaneously, we need to correct for multiple testing (e.g., using the Benjamini-Hochberg procedure).
voyagerSimilar to local Moran’s I, there is a local Geary’s C (Anselin 1995) coefficient. It is defined as
\[C_i = \sum_{j=1}^n w_{ij}(x_i-x_j)^2\]
The interpretation is analogous to the global Geary’s C (value less than \(1\) indicates positive auto-correlation, a value greater than \(1\) highlights negative auto-correlation).
In this example, we will not plot the local Geary’s C coefficient for gene expression but for features that are associated with an individual cell, e.g., the number of counts or the number of genes expressed. For this, the colDataUnivariate function is used to calculate the local Geary’s C coefficient for such features.
VoyagerThe local Getis-Ord \(G_i\) (J. K. Ord and Getis 1995; Getis and Ord 1992) statistic quantifies the weighted concentration of points within a radius \(d\) and in a local region \(i\), according to:
\[G_i(d) = \frac{\sum_{j \neq i } w_{ij}(d)x_j}{\sum_{j \neq i} x_j}\]
There is a variant of this statistic, \(G_i^*(d)\), which is the same as \(G_i(d)\) except that the contribution when \(j=i\) is included in the term.
voyagerThe results above gives an estimate of the local Getis-Ord statistic for each cell, but no significance value. This can be done by using a permutation approach using the localG_perm argument.
Positive values indicate clustering of high values, i.e., hot spots, and negative values indicate clustering of low values, i.e., cold spots. The method does not detect outlier values because, unlike in local Moran’s I, there is no cross-product between \(i\) and \(j\). But unlike local Moran’s I, we know the type of interaction (high-high or low-low) between \(i\) and \(j\).
The local spatial heteroscedasticity (LOSH) is a measure of spatial autocorrelation that is based on the variance of the local neighbourhood. Unlike the other measures, this method does not assume homoscedastic variance over the whole tissue region. LOSH is defined as:
\[H_i(d) = \frac{\sum_j w_{ij}(d)|e_j(d)|^a}{\sum_j w_{ij}(d)}\]
where \(e_j(d) = x_j - \bar{x}_i(d), j \in N(i,d)\) are the local residuals that are subtracted from the local mean. The power \(a\) modulates the interpretation of the residuals (\(a=1\): residuals are interpreted as absolute deviations from the local mean; \(a=2\): residuals are interpreted as deviations from the local variance).
The LOSH should be interpreted in combination with the local Getis-Ord \(G_i^*\) statistic. The \(G_i^*\) quantifies the local mean of the variable of interest, while \(H_i\) quantifies the local variance. This table provided by Ord and Getis (J. Keith Ord and Getis 2012) summarizes the interpretation of the combination of \(G_i^*\) and \(H_i\).
| high \(H_i\) | low \(H_i\) | |
|---|---|---|
| large \(\|G_i^*\|\) | A hot spot with heterogeneous local conditions | A hot spot with similar surrounding areas; the map would indicate whether the affected region is larger than the single “cell” |
| small $ |G_i^*| $ | Heterogeneous local conditions but at a low average level (an unlikely event) | Homogeneous local conditions and a low average level |
VoyagerThe local methods presented above should always be interpreted with care, since we face the problem of multiple testing when calculating them for each cell. Moreover, the presented methods should mainly serve as exploratory measures to identify interesting regions in the data. Multiple processes can lead to the same pattern, thus from identifying the pattern we cannot infer the underlying process. Indication of clustering does not explain why this occurs. On the one hand, clustering can be the result of spatial interaction between the variables of interest. We have an accumulation of a gene of interest in one region of the tissue. On the other hand clustering can be the result spatial heterogeneity, when local similarity is created by structural heterogeneity in the tissue, e.g., that cells with uniform expression of a gene of interest are grouped together which then creates the apparent clustering of the gene expression measurement.
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] magrittr_2.0.3 stringr_1.5.0
[3] dixon_0.0-8 splancs_2.01-44
[5] spdep_1.2-8 spData_2.3.0
[7] tmap_3.3-4 scater_1.28.0
[9] scran_1.28.2 scuttle_1.10.3
[11] SFEData_1.2.0 SpatialFeatureExperiment_1.2.3
[13] Voyager_1.2.7 rgeoda_0.0.10-4
[15] digest_0.6.33 ncf_1.3-2
[17] sf_1.0-16 reshape2_1.4.4
[19] patchwork_1.2.0 STexampleData_1.8.0
[21] ExperimentHub_2.8.1 AnnotationHub_3.8.0
[23] BiocFileCache_2.8.0 dbplyr_2.3.4
[25] RANN_2.6.1 seg_0.5-7
[27] sp_2.1-1 rlang_1.1.1
[29] ggplot2_3.5.1 dplyr_1.1.3
[31] mixR_0.2.0 spatstat_3.0-6
[33] spatstat.linnet_3.1-1 spatstat.model_3.2-6
[35] rpart_4.1.19 spatstat.explore_3.2-3
[37] nlme_3.1-162 spatstat.random_3.1-6
[39] spatstat.geom_3.2-5 spatstat.data_3.0-1
[41] SpatialExperiment_1.10.0 SingleCellExperiment_1.22.0
[43] SummarizedExperiment_1.30.2 Biobase_2.60.0
[45] GenomicRanges_1.52.1 GenomeInfoDb_1.36.4
[47] IRanges_2.34.1 S4Vectors_0.38.2
[49] BiocGenerics_0.46.0 MatrixGenerics_1.12.3
[51] matrixStats_1.0.0
loaded via a namespace (and not attached):
[1] spatstat.sparse_3.0-2 bitops_1.0-7
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] tools_4.3.1 utf8_1.2.3
[7] R6_2.5.1 HDF5Array_1.28.1
[9] mgcv_1.9-1 rhdf5filters_1.12.1
[11] withr_2.5.1 gridExtra_2.3
[13] leaflet_2.2.0 leafem_0.2.3
[15] cli_3.6.1 labeling_0.4.3
[17] proxy_0.4-27 R.utils_2.12.2
[19] dichromat_2.0-0.1 scico_1.5.0
[21] limma_3.56.2 rstudioapi_0.15.0
[23] RSQLite_2.3.1 generics_0.1.3
[25] crosstalk_1.2.0 Matrix_1.5-4.1
[27] ggbeeswarm_0.7.2 fansi_1.0.5
[29] abind_1.4-5 R.methodsS3_1.8.2
[31] terra_1.7-55 lifecycle_1.0.3
[33] yaml_2.3.7 edgeR_3.42.4
[35] rhdf5_2.44.0 tmaptools_3.1-1
[37] grid_4.3.1 blob_1.2.4
[39] promises_1.2.1 dqrng_0.3.1
[41] crayon_1.5.2 lattice_0.21-8
[43] beachmat_2.16.0 KEGGREST_1.40.1
[45] magick_2.8.0 pillar_1.9.0
[47] knitr_1.44 metapod_1.7.0
[49] rjson_0.2.21 boot_1.3-28.1
[51] codetools_0.2-19 wk_0.8.0
[53] glue_1.6.2 vctrs_0.6.4
[55] png_0.1-8 gtable_0.3.4
[57] cachem_1.0.8 xfun_0.40
[59] S4Arrays_1.0.6 mime_0.12
[61] DropletUtils_1.20.0 units_0.8-4
[63] statmod_1.5.0 bluster_1.10.0
[65] interactiveDisplayBase_1.38.0 ellipsis_0.3.2
[67] bit64_4.0.5 filelock_1.0.2
[69] irlba_2.3.5.1 vipor_0.4.5
[71] KernSmooth_2.23-21 colorspace_2.1-0
[73] DBI_1.1.3 raster_3.6-26
[75] tidyselect_1.2.0 bit_4.0.5
[77] compiler_4.3.1 curl_5.1.0
[79] BiocNeighbors_1.18.0 DelayedArray_0.26.7
[81] scales_1.3.0 classInt_0.4-10
[83] rappdirs_0.3.3 goftest_1.2-3
[85] spatstat.utils_3.0-5 rmarkdown_2.25
[87] XVector_0.40.0 htmltools_0.5.6.1
[89] pkgconfig_2.0.3 base64enc_0.1-3
[91] sparseMatrixStats_1.12.2 fastmap_1.1.1
[93] htmlwidgets_1.6.2 shiny_1.7.5.1
[95] DelayedMatrixStats_1.22.6 farver_2.1.1
[97] jsonlite_1.8.7 BiocParallel_1.34.2
[99] R.oo_1.25.0 BiocSingular_1.16.0
[101] RCurl_1.98-1.12 GenomeInfoDbData_1.2.10
[103] s2_1.1.4 Rhdf5lib_1.22.1
[105] munsell_0.5.0 Rcpp_1.0.11
[107] ggnewscale_0.4.9 viridis_0.6.4
[109] stringi_1.7.12 leafsync_0.1.0
[111] zlibbioc_1.46.0 plyr_1.8.9
[113] parallel_4.3.1 ggrepel_0.9.4
[115] deldir_1.0-9 Biostrings_2.68.1
[117] stars_0.6-4 splines_4.3.1
[119] tensor_1.5 locfit_1.5-9.8
[121] igraph_1.5.1 ScaledMatrix_1.8.1
[123] BiocVersion_3.17.1 XML_3.99-0.14
[125] evaluate_0.22 BiocManager_1.30.22
[127] httpuv_1.6.11 purrr_1.0.2
[129] polyclip_1.10-6 scattermore_1.2
[131] rsvd_1.0.5 lwgeom_0.2-13
[133] xtable_1.8-4 e1071_1.7-13
[135] RSpectra_0.16-1 later_1.3.1
[137] viridisLite_0.4.2 class_7.3-22
[139] tibble_3.2.1 memoise_2.0.1
[141] beeswarm_0.4.0 AnnotationDbi_1.62.2
[143] cluster_2.1.4
©2024 The pasta authors. Content is published under Creative Commons CC-BY-4.0 License for the text and GPL-3 License for any code.